Generalized Linear Models II

Chelsea Parlett-Pelleriti

Applications/Extensions of GLMs

Mixed Effect Models

Mixed Effect Models

\[ \underbrace{\hat{y} = \beta_0 + \beta_1x_1}_\text{simple linear regression}\\ \]

Let’s say we’re predicting the effect of fundraising (\(x_i\): number of dollars raised) on standardized test scores (\(y_i\)). But lets say that we are measuring students from 5 different schools. Could each school have a different baseline (\(\beta_0\)) standardized test score?

Mixed Effect Models

\[ \hat{y} = \beta_{0[j]} + \beta_1x_1 \]

Option: fit a separate intercept for each school (\(\beta_{0[j]}\)).

But…what if we think each school might have a different effect (e.g. schools with less funding might benefit more from fundraising).

Mixed Effect Models

\[ \hat{y} = \beta_{0[j]} + \beta_{1[j]}x_1 \]

Option: fit a separate intercept (\(\beta_{0[j]}\)) and slope (\(\beta_{1[j]}\)) for each school.

Great! But now we have two problems:

  • we’re throwing away information, if school A’s slope is 0.01, then school B’s effect is probably not 10,000

  • we can’t generalize our findings to other schools, only the one’s measured

Mixed Effect Models

💡 let’s assume that the school effects \(\beta_{0[j]}\) and \(\beta_{1[j]}\) are effects that are drawn from a population distribution of effects

\[ \beta_{0[j]} \sim \mathcal{N}\left(\mu_0, \sigma_0 \right) \\ \beta_{1[j]} \sim \mathcal{N}\left(\mu_1, \sigma_1 \right) \]

Now:

  • we can generalize our findings to other schools by learning about \(\mu_0, \sigma_0\) and \(\mu_1, \sigma_1\)

Mixed Effect Models

\[ \hat{y} = \underbrace{\beta_{0[j]}}_\text{RE} + \underbrace{\beta_1x_1}_\text{FE} \]

  • Random Effects: parameter values are drawn randomly from a population with mean effect \(\mu\) and standard deviation \(\sigma\)

    • we are not observing all the levels of the variable we’re interested in generalizing to (e.g. subjects)
  • Fixed Effects: parameter values are fixed

    • we are observing all the levels of the variable we’re interested in (e.g. treatment/control, year in school, country of origin)

Note: Mixed effect models are called that because we have both Random + Fixed effects

Sleep Study Data

library(lme4)
library(dplyr)
library(tibble)

sleepstudy <- sleepstudy %>% 
  as_tibble() %>% 
  mutate(Subject = as.character(Subject))

# Add two fake participants
df_sleep <- bind_rows(
  sleepstudy,
  tibble(Reaction = c(286, 288), Days = 0:1, Subject = "374"),
  tibble(Reaction = 245, Days = 0, Subject = "373"))

df_sleep %>% glimpse()
Rows: 183
Columns: 3
$ Reaction <dbl> 249.5600, 258.7047, 250.8006, 321.4398, 356.8519, 414.6901, 3…
$ Days     <dbl> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0…
$ Subject  <chr> "308", "308", "308", "308", "308", "308", "308", "308", "308"…

modified from the insanely good blog by TJ Mahr

Sleep Study Data

Complete Pooling

All subjects are pooled together, they share all information.

\[ \hat{rt} = \beta_0 + \beta_1 *days \]

# A tibble: 3 × 4
  Model            Subject Intercept Slope_Days
  <chr>            <chr>       <dbl>      <dbl>
1 Complete pooling 308          252.       10.3
2 Complete pooling 309          252.       10.3
3 Complete pooling 310          252.       10.3

No-Pooling

\[ \hat{rt} = \beta_{0[j]} + \beta_{1[j]} * days \]

All subjects are kept completely separate, no information is shared.

  Subject Intercept Slope_Days      Model
1     308  244.1927  21.764702 No pooling
2     309  205.0549   2.261785 No pooling
3     310  203.4842   6.114899 No pooling

Complete vs. No Pooling

Partial Pooling

\[ rt = \beta_{0[j]} + \beta_{1[j]} * days \\ \beta_{0[j]} \sim \mathcal{N}(\mu_0, \sigma_0); \beta_{1[j]} \sim \mathcal{N}(\mu_1, \sigma_1) \]

All subjects can learn from each other, but separation is preserved.

# A tibble: 3 × 4
  Subject Intercept Slope_Days Model          
  <chr>       <dbl>      <dbl> <chr>          
1 308          254.      19.6  Partial pooling
2 309          212.       1.73 Partial pooling
3 310          213.       4.91 Partial pooling

Partial Pooling

Partial Pooling

Regularization Towards Group Mean

Random Intercepts

y ~ x + (1 | pid)

Random Slopes

y ~ x + (x | pid)

Random Intercepts + Slopes

y ~ x + (1 + x | pid)

RE Covariance Structure

Mixed Effect Regularization

\[ \beta_{0[j]} \approx w * \left( \bar{y_j} - \beta\bar{x_j}\right) + (1-w) * \mu_0 \\ w = \frac{\frac{n_j}{\sigma^2_{y}}}{\frac{n_j}{\sigma^2_{y}} + \frac{1}{\sigma^2_{\beta_0}}} \]

💡 when will \(w\) be large? when with it be small?

Mixed Effect Regularization

\[ \beta_{0[j]} \approx w * \left( \bar{y_j} - \beta\bar{x_j}\right) + (1-w) * \mu_0 \\ w = \frac{\frac{n_j}{\sigma^2_{y}}}{\frac{n_j}{\sigma^2_{y}} + \frac{1}{\sigma^2_{\beta_0}}} \]

💡 when will \(w\) be large? when with it be small?

  • when \(n_j\) is large, \(w\) will be large
  • when \(\sigma^2_{y}\) is small compared to \(\sigma^2_{y}\), \(w\) will be large

Fitting a MEM in R

library(lme4)

linear_reg <- lm(Reaction ~ Days, df_sleep)
summary(linear_reg)

Call:
lm(formula = Reaction ~ Days, data = df_sleep)

Residuals:
     Min       1Q   Median       3Q      Max 
-110.646  -27.951    1.829   26.388  139.875 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  252.321      6.406  39.389  < 2e-16 ***
Days          10.328      1.210   8.537 5.48e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 47.43 on 181 degrees of freedom
Multiple R-squared:  0.2871,    Adjusted R-squared:  0.2831 
F-statistic: 72.88 on 1 and 181 DF,  p-value: 5.484e-15

Fitting a MEM in R

random_int <- lmer(Reaction ~ Days + (1 | Subject), df_sleep)
summary(random_int)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 | Subject)
   Data: df_sleep

REML criterion at convergence: 1815.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.2361 -0.5406  0.0143  0.5204  4.2699 

Random effects:
 Groups   Name        Variance Std.Dev.
 Subject  (Intercept) 1296     36.00   
 Residual              952     30.85   
Number of obs: 183, groups:  Subject, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept) 252.5777     9.1033   27.75
Days         10.4319     0.7963   13.10

Correlation of Fixed Effects:
     (Intr)
Days -0.367

Fitting a MEM in R

random_int_slope <- lmer(Reaction ~ Days + (1 + Days | Subject), df_sleep)
summary(random_int_slope)
Linear mixed model fit by REML ['lmerMod']
Formula: Reaction ~ Days + (1 + Days | Subject)
   Data: df_sleep

REML criterion at convergence: 1771.4

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-3.9707 -0.4703  0.0276  0.4594  5.2009 

Random effects:
 Groups   Name        Variance Std.Dev. Corr
 Subject  (Intercept) 582.72   24.140       
          Days         35.03    5.919   0.07
 Residual             649.36   25.483       
Number of obs: 183, groups:  Subject, 20

Fixed effects:
            Estimate Std. Error t value
(Intercept)  252.543      6.433  39.257
Days          10.452      1.542   6.778

Correlation of Fixed Effects:
     (Intr)
Days -0.137

Reporting MEM Results

  • the average effect \(\mu\) : the expected effect for the population

  • the standard deviation/variance \(\sigma\): how variable this effect is across units

Survival Analysis

Time Til Repository Death

Survival Data

\[ \hat{\text{survival time}} = \beta_0 + \beta_1 * \text{treatment} + \beta_2 * \text{bp} \]


❓ do we actually know everyone’s survival time at the end of the study

Censoring

\[ Y = \min(T,C) \\ \delta = \left\{\begin{matrix} 0& \text{if }\hspace{.1cm} T>C \\ 1& \text{if }\hspace{0.1cm} T \leq C\\\end{matrix}\right. \]

Kaplan-Meier Curves

\[ S(t) = Pr(T > t) \]

A decreasing curve showing how likely it is to survive until time \(t\) (in other words, time of event \(T > t\))

Kaplan-Meier Curves

for all event times \(d_k\)

  • \(q_k\) is the number of subjects who had an event at \(d_k\).
  • \(r_k\) is the number of subjects who can have an event at \(d_k\) ( the “at risk” group).

\[ Pr(T > d_k) = \\ Pr(T > d_k | T > d_{k-1})Pr(T > d_{k-1}) + \\ \underbrace{Pr(T > d_k | T \leq d_{k-1})Pr(T \leq d_{k-1})}_\text{impossible} = \\ Pr(T > d_k | T > d_{k-1})Pr(T > d_{k-1}) \]

Kaplan-Meier Curves

\[ S(d_k) = Pr(T > d_k) = Pr(T > d_k | T > d_{k-1})\color{#D55E00}{Pr(T > d_{k-1})}\\ S(d_k) = Pr(T > d_k) = Pr(T > d_k | T > d_{k-1})\color{#D55E00}{S(d_{k-1}) } \]


Can be estimated by:

\[ \hat{Pr}(T > d_j | T >d_{j-1}) = \frac{r_j - q_j}{r_j} \]

Kaplan-Meier Curves

\[ S(d_k) = \prod_{j=1}^k \frac{r_j - q_j}{r_j} \]

Cox Proportional Hazards

Hazard: force of mortality 😳 . Death rate the instant after time \(t\).

\[ h(t) = \lim_{\Delta t \to 0} \frac{Pr(t < T \leq t + \Delta t | T> t)}{\Delta t} \]


the higher \(h(t)\), the higher the probability of death.

Cox Proportional Hazards

\[ h(t) = \lim_{\Delta t \to 0} \frac{Pr(t < T \leq t + \Delta t | T> t)}{\Delta t} \]


Remember, \(P(A|B) = P(A \cap B) / P(B)\), so:

\[ h(t) = \lim_{\Delta t \to 0} \frac{Pr\left[(t < T \leq t + \Delta t) \cap (T> t)\right]/\Delta t}{Pr(T>t)} = \\ \lim_{\Delta t \to 0} \frac{Pr(t < T \leq t + \Delta t) /\Delta t}{Pr(T>t)} = \\ = \\ \lim_{\Delta t \to 0} \frac{f(t)}{S(t)} \]

Cox Proportional Hazards

  • \(f(t)\) is the PDF of of \(T\)
  • \(S(t)\) is the Survival Function

Cox Proportional Hazards

Assumption of Proportional Hazards: \[ h(t|\mathbf{x}_i) = h_0(t)*e^{\beta \cdot \mathbf{x}_i} \]

  • \(h_0(t)\) is the hazard function for a person with all predictor values \(\mathbf{x}_i =0\)

  • \(e^{\beta \cdot \mathbf{x}_i}\) is the relative risk of a person with predictors \(\mathbf{x}_i\) compared to a person with predictors \(\mathbf{x}_i = 0\)

Cox Proportional Hazards

  • a 1-unit increase in \(x_{ij}\) corresponds to an change in \(h(t | \mathbf{x}_i)\) a factor of \(e^{\beta_j}\) (multiplicative)

figure from: An Introduction to Statistical Learning with Applications in R, 2nd Edition

Cox Proportional Hazards

library(ISLR2)
attach(BrainCancer)
head(BrainCancer,3)
     sex  diagnosis            loc ki   gtv stereo status  time
1 Female Meningioma Infratentorial 90  6.11    SRS      0 57.64
2   Male  HG glioma Supratentorial 90 19.35    SRT      1  8.98
3 Female Meningioma Infratentorial 70  7.95    SRS      0 26.46

Cox Proportional Hazards

❓ Does sex impact survival time for people with primary brain tumors undergoing treatment with stereotactic radiation methods?

library(survival)

# KM curve
fit_surv <- survfit(Surv(time , status) ~ sex)
plot(fit_surv, xlab = "Months", ylab = "Estimated Probability of Survival",col = c(2,4))
legend("bottomleft", levels(sex), col = c(2,4), lty = 1)

Cox Proportional Hazards

The risk associated with being male is 1.5x (0.77x-2.94x) the risk associated with being female.

fit_cox <- coxph(Surv(time , status) ~ sex)
summary(fit_cox)
Call:
coxph(formula = Surv(time, status) ~ sex)

  n= 88, number of events= 35 

          coef exp(coef) se(coef)     z Pr(>|z|)
sexMale 0.4077    1.5033   0.3420 1.192    0.233

        exp(coef) exp(-coef) lower .95 upper .95
sexMale     1.503     0.6652     0.769     2.939

Concordance= 0.565  (se = 0.045 )
Likelihood ratio test= 1.44  on 1 df,   p=0.2
Wald test            = 1.42  on 1 df,   p=0.2
Score (logrank) test = 1.44  on 1 df,   p=0.2

Cox Proportional Hazards

fit_cox2 <- coxph(Surv(time , status) ~ sex + diagnosis +
                    loc + ki + gtv + stereo)
summary(fit_cox2)
Call:
coxph(formula = Surv(time, status) ~ sex + diagnosis + loc + 
    ki + gtv + stereo)

  n= 87, number of events= 35 
   (1 observation deleted due to missingness)

                       coef exp(coef) se(coef)      z Pr(>|z|)    
sexMale             0.18375   1.20171  0.36036  0.510  0.61012    
diagnosisLG glioma  0.91502   2.49683  0.63816  1.434  0.15161    
diagnosisHG glioma  2.15457   8.62414  0.45052  4.782 1.73e-06 ***
diagnosisOther      0.88570   2.42467  0.65787  1.346  0.17821    
locSupratentorial   0.44119   1.55456  0.70367  0.627  0.53066    
ki                 -0.05496   0.94653  0.01831 -3.001  0.00269 ** 
gtv                 0.03429   1.03489  0.02233  1.536  0.12466    
stereoSRT           0.17778   1.19456  0.60158  0.296  0.76760    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

                   exp(coef) exp(-coef) lower .95 upper .95
sexMale               1.2017     0.8321    0.5930    2.4352
diagnosisLG glioma    2.4968     0.4005    0.7148    8.7215
diagnosisHG glioma    8.6241     0.1160    3.5664   20.8546
diagnosisOther        2.4247     0.4124    0.6678    8.8031
locSupratentorial     1.5546     0.6433    0.3914    6.1741
ki                    0.9465     1.0565    0.9132    0.9811
gtv                   1.0349     0.9663    0.9906    1.0812
stereoSRT             1.1946     0.8371    0.3674    3.8839

Concordance= 0.794  (se = 0.04 )
Likelihood ratio test= 41.37  on 8 df,   p=2e-06
Wald test            = 38.7  on 8 df,   p=6e-06
Score (logrank) test = 46.59  on 8 df,   p=2e-07

IRT

Spelling Test

Latent Variable Modeling

  • difficulty of item

  • spelling ability of person


Neither of these things are observed, but we can estimate them using the observed correctness of spelling test items.

Example

💡 Think back to all the tests you’ve taken:

  • on a given test, does every item have the same difficulty?

  • which questions do you think give the prof the most information about your ability?

  • even if you’ve studied the topic of a question well, do you always get it correct?

Logistic Regression Review

\[ \mathbb{E}\left( y | \mathbf{X} \right) = g^{-1}\left(\mathbf{X}\beta \right) \\g\left(\mathbf{X} \right) = log\left( \frac{p}{1-p} \right); g^{-1}\left(\mathbf{X} \right) = \frac{1}{1+e^{-x}} \\p\left (y | \mathbf{X} \right) \sim bernoulli\left( \mathbf{X}\beta \right) \]

Predicting a binary outcome (correct/incorrect) using various predictors \(\mathbf{X}\).

\[ \mathbf{X}\beta = \beta_0 * 1 + \beta_1 * x_1 + ... + \beta_p * x_p \]

IRT Logistic Curves

Rasch (1PL) Model

Rasch (1PL) Model

\[ P(Y_j = 1 | \theta) = \frac{e^{\theta-\beta_j}}{1 + e^{\theta - \beta_j}} \]

in other words

\[ \mathbb{E}\left( y | \theta \right) = g^{-1}(\theta - \beta_j) \]

  • \(\theta\): latent trait value (ability) for person

  • \(\beta_j\): difficulty of item

2PL Model

2PL Model

\[ P(Y_j = 1 | \theta) = \frac{e^{\alpha_j\left[\theta-\beta_j\right]}}{1 + e^{\alpha_j\left[\theta-\beta_j\right]}} \]

  • \(\theta\): latent trait value (ability) for person

  • \(\beta_j\): difficulty of item

  • \(\alpha_j\): discrimination of item


the higher the discrimination, the more information an item gives you about someone’s latent trait value.

3PL Model

3PL

\[ P(Y_j = 1 | \theta) = c_j + (1-c_j) *\frac{e^{\alpha_j\left[\theta-\beta_j\right]}}{1 + e^{\alpha_j\left[\theta-\beta_j\right]}} \]

  • \(\theta\): latent trait value (ability) for person

    • Standard Normally distributed
  • \(\beta_j\): difficulty of item

  • \(\alpha_j\): discrimination of item

  • \(c_j\): “guessability” of item

Sum Scores vs. Weighted Scores

When grading exams you can either:

  • sum score: weight each question equally

  • weighted score: weight each question differently


Sum scores are very common on tests, surveys. But they make the assumption that each item gives you the same amount of information.

Sum Scores vs. Weighted Scores

Option: weight by \(alpha_j\)

\[ \text{total score} = \sum_{j=1}^n \alpha_jx_{ij} \]

where \(\alpha_j\) is the discrimination parameter for item \(j\) and \(x_j\) is a binary variable indicating a person \(i\)’s correctness on item \(j\).

The steeper the slope of the ICC, the more information an item gives us about someone’s ability, so the higher weight it gets in our score.

Item Characteristic Curve

Item Information Curve

IIC: the derivative of ICC. How much information an item gives about different latent trait values.

Test Information Curve

Test Information Curves

Test Information Curves

Test Information Curves

Test Information Curve

Applications of IRT

  • Survey Design: e.g. A Survey about externship success in Chapman’s Law School, a survey measuring Depression

  • Standardized Tests: e.g. the GRE/SAT/ACT

  • Lab Assessments: e.g. people’s ratings of their own memory

Applications of IRT

IRT in R

library(mirt)
set.seed(540)
d <- sim_irt_df(n_people = 30, n_items = 10)

fit2PL <- mirt(data = d, 
               model = 1, # one dimensional latent trait
               itemtype = "2PL", 
               verbose = FALSE)
summary(fit2PL)
           F1    h2
Item_1  0.351 0.123
Item_2  0.547 0.300
Item_3  0.435 0.189
Item_4  0.922 0.849
Item_5  0.472 0.222
Item_6  0.863 0.744
Item_7  0.597 0.356
Item_8  0.602 0.363
Item_9  0.524 0.275
Item_10 0.243 0.059

SS loadings:  3.48 
Proportion Var:  0.348 

Factor correlations: 

   F1
F1  1

IRT in R

fit2PL <- mirt(data = d, 
               model = 1, # one dimensional latent trait
               itemtype = "Rasch", 
               verbose = FALSE)
summary(fit2PL)
           F1    h2
Item_1  0.572 0.327
Item_2  0.572 0.327
Item_3  0.572 0.327
Item_4  0.572 0.327
Item_5  0.572 0.327
Item_6  0.572 0.327
Item_7  0.572 0.327
Item_8  0.572 0.327
Item_9  0.572 0.327
Item_10 0.572 0.327

SS loadings:  3.266 
Proportion Var:  0.327 

Factor correlations: 

   F1
F1  1

IRT in R

plot(fit2PL, type = "trace", facet_items = FALSE)

plot(fit2PL, type = "infotrace", facet_items = FALSE)

plot(fit2PL, type = "info")

Sneek Peek: Why Bayesian IRT?

💡 if each item has 1-3 parameters (\(\alpha_j, \beta_j, c_j\)) , and each person has 1 parameter (\(\theta_i\)), what happens to the number of parameters as items and sample size increases?